Problem 2: Non-Conjugate Inference

In the following we will implement, from scratch, Bayesian non-conjugate inference. The standard computational approach is to resort to Markov Chain Monte Carlo (MCMC) in order to sample from the posterior. In order to stress that MCMC is only one computational approach, although arguably the most appropriate for most real-world scenarios, in this notebook we will explore a few alternative numerical methods too.

The numerical challenge we will consider is the computation of the mean or expected value of a posterior distribution, $\mathbb{E}[\theta | \underline{Y} = \underline{y}]$. The posterior mean is an important quantity to know and, in applied problems, often has a real-world interpretation.

The numerical methods we consider are:

Since we are implementing everything from scratch, we will be implementing each of these numerical methods. We will also ensure that the Bayesian inference problem that we consider is sufficiently simple to be amenable to these methods.

Notebook settings

The following changes the default plot size for the notebook:

Loading Libraries

The following loads the required packages to run all the code in this notebook:

The Inference Problem

The particular inference problem we consider is one-dimensional parameter problem: inferring the rate parameter $\lambda$ of an Exponential distribution $\text{Exp}(\lambda)$.

Data Generation

Similar to the last problem, we will use artificial data.

Defining the Likelihood

The density function of an exponentially distributed random variable $Y_i \sim \text{Exp}(\lambda)$ takes the form

$$ p(y_i|\lambda) = \lambda \exp\{-\lambda y_i\}.$$

Implementing this as a log density:

$$ \log p(y_i|\lambda) = \log(\lambda) - \lambda y_i.$$

Defining the full likelihood

Assuming $n$ IID observations, the full likelihood function takes the form

$$ p(\underline{y}|\lambda) = \prod_{i=1}^{n} \lambda \exp\{-\lambda y_i\}. $$

The log-likelihood then takes the form:

$$ \log p(\underline{y}|\lambda) = \sum_{i=1}^{n} \log(\lambda) -\lambda y_i. $$

Defining the Prior

For the sake of ensuring lack of conjugacy, we will take our prior to follow a log-normal distribution. A log-normal distribution is the distribution of $\exp(X)$, whenever $X$ is normally distributed. That is, if $\lambda \sim \text{LogNormal}(\mu,\sigma^2)$, then $\log(\lambda) \sim \mathcal{N}(\mu,\sigma^2)$. Hence the name "log-normal".

Since it is the exponential of a normal distribution, a log-normally distributed random variable is guaranteed to be positive. This is helpful since we know, a priori, that the rate parameter $\lambda$ is greater than $0$.

We take the prior $\lambda \sim \text{LogNormal}(\mu, \sigma^2)$. The density is log-density are available in closed form:

$$ p(\lambda) = \frac{1}{\lambda\sigma\sqrt{2\pi}} \exp\left(-\frac{1}{2}\left(\frac{\log(\lambda) - \mu}{\sigma}\right)^2\right),$$$$ \log p(\lambda) = -\log \lambda - \log \sigma\sqrt{2\pi} -\frac{1}{2}\left(\frac{\log(\lambda) - \mu}{\sigma}\right)^2.$$

Plotting our Prior

The Bayesian Update

In the following implementation, dposterior evaluates the unnormalised posterior

$$ \tilde{p}(\lambda | \underline{y}) = p(\underline{y}|\lambda)p(\lambda), $$

where the data used, y_data, and prior parameters, prior_params, are kept as arguments.

Plotting the Unnormalised Posterior

In order to investigate the sensitivity of the Bayesian update to the amount of data and prior specification, let's condition on different data and use different prior parameters.

Feel free to vary the variables true_rate (the actual $\lambda$ value the data is generated from), n (the amount of data), mu and sd (parameters controlling the prior) and see how they affect the Bayesian updating.

Let's plot the likelihood, prior and the posterior altogether:

The Numerical Challenge: Computing The Posterior Mean

The mean of the posterior distribution is an important quantity and is the integral:

$$ \mathbb{E}[\theta | \underline{y}] = \int_{\Theta} \theta p(\theta | \underline{y})\, \mathrm{d}\theta. $$

Since we are in the non-conjugate setting, the computation of this integral also requires the computation of the normalisation constant (denoted $C$):

$$ \mathbb{E}[\theta | \underline{y}] = \int_{\Theta} \theta \frac{p(\underline{y}|\theta)p(\theta)}{\int_\Theta p(\underline{y}|\theta)p(\theta)\,\mathrm{d}\theta }\, \mathrm{d}\theta = \underbrace{\frac{1}{\int_\Theta p(\underline{y}|\theta)p(\theta)\,\mathrm{d}\theta}}_{1/C}\int_{\Theta} \theta \underbrace{p(\underline{y}|\theta)p(\theta)}_{\text{Unnormalised Posterior}} \, \mathrm{d}\theta. $$

Here, we just used the definition of the posterior distribution through Bayes formula and moved the normalisation constant outside the integral (since it is a constant).

We will soon see how if we are able to sample from the posterior distribution, we can estimate this posterior mean without the computation of the normalisation constant. Some sampling methods, such as Markov Chain Monte Carlo (MCMC) methods, are able to sample from $p(\theta|\underline{y})$ without the value of $C$ being given.

In the following, we will compare a few approaches for the computation of the posterior mean of the non-conjugate Bayesian inference problem we specified above.

Quadrature: The Trapezium Rule

Our first approach of computing the posterior mean is by using a quadrature method known as the trapezium rule.

Suppose we want to compute the integral

$$ I = \int_a^b f(x) \,\mathrm{d}x.$$

For example, in the following, we are integrating the function

$$ f(x) = \exp(\sin(1/x)) $$

from $x = 0.1$ to $x = 0.4$.

Quadrature proceeds by approximating the integral using evaluations $f(x_1),\ldots, f(x_n)$ at locations $x_1,\ldots, x_n$ by the construction an interpolant:

The previous interpolation is linear interpolation, there are many different types of interpolation and methods to pick the locations $x_1,\ldots,x_n$. The associate quadrature algorithm is called the trapezium rule because the total approximating area is built from trapeziums:

The area of the trapezium with vertices $(x_i,0), (x_i, f(x_i)), (x_{i+1},f(x_{i+1})), (x_{i+1},0)$ is $$ \frac{f(x_{i+1}) + f(x_i)}{2} (x_{i+1}-x_i).$$ Summing over all the individual trapezium areas yields the (composite) trapezium rule: $$ \text{Trap}(f, a, b, n) = \frac{1}{2}\sum_{i=1}^{n-1} (f(x_{i+1}) + f(x_i))(x_{i+1}-x_i), $$ where $x_1 = a$ and $x_n = b$.

All standard integration routines in Python (such as scipy.quad) and in R (such as), provide implementations of quadrature algorithms similar in spirit to the trapezium rule. Albeit, more advanced and better in most situations! For more advanced examples of quadrature see Simpson's rule or Gaussian Quadrature.

Implementing the Trapezium Rule

Let's implement the trapezium rule! The following function trapezium_rule approximates the integral of a univariate function f from a to b using the trapezium rule over a uniform mesh of n evaluations (with default n=50 used).

To check our implementation is correct we can check our quadrature result agrees with Wolfram Alpha. Note that Wolfram Alpha uses a quadrature routine to approximate this integral too, since it is unavailable in closed form.

Using the Trapezium Rule to Compute the Posterior Mean

Recall that we want to compute the following

$$ \mathbb{E}[\theta | \underline{y}] = \int_{\Theta} \theta \frac{p(\underline{y}|\theta)p(\theta)}{\int_\Theta p(\underline{y}|\theta)p(\theta)\,\mathrm{d}\theta }\, \mathrm{d}\theta = \underbrace{\frac{1}{\int_\Theta p(\underline{y}|\theta)p(\theta)\,\mathrm{d}\theta}}_{1/C}\int_{\Theta} \theta \underbrace{p(\underline{y}|\theta)p(\theta)}_{\text{Unnormalised Posterior}} \, \mathrm{d}\theta. $$

Therefore, to compute the posterior mean, we need to evaluate two integrals:

$$ \text{Integral 1: } \int_\Theta p(\underline{y}|\theta)p(\theta)\,\mathrm{d}\theta = \int_{0}^\infty p(\underline{y}|\lambda)p(\lambda)\,\mathrm{d}\lambda . $$$$ \text{Integral 2: } \int_{\Theta} \theta p(\underline{y}|\theta)p(\theta) \, \mathrm{d}\theta = \int_{0}^{\infty} \lambda p(\underline{y}|\lambda)p(\lambda) \, \mathrm{d}\lambda . $$

The trapezium really only works for finite limits $a$ and $b$ ($\int_a^b f(x)\mathrm{d}x$). The lower limit will be $a=0$ and we will just use a large number $ b \gg 0$ for the upper limit.

In order to use the trapezium rule, we need to implement the integrand of the second integral (the function $\lambda \mapsto \lambda p(\underline{y}|\lambda)p(\lambda)$). We also need to perform currying to turn our functions into the form that trapezium_rule expects, we will use the y_data and prior_params we defined earlier. In the following, integrand_1 is the integrand of the first integral (computation of the normalisation constant $C$), and integrand_2 is the integrand of the second integral. Note that integrand_2 is just the output of integrand_1 multipled by lambda:

We are now ready to use our trapezium rule methodology:

Monte-Carlo Integration

The remaining numerical methods we will explore are all instances of Monte-Carlo integration and so I thought it would be best to have a brief introduction to this. First off, let's clarify a potential confusion:

  1. Markov Chain Monte Carlo (MCMC) is a class of numerical methods that aim to sample from a given probability distribution $\mathbb{P}$ (e.g. a posterior distribution) using a Markov Chain.
  2. Monte-Carlo Integration is a class of numerical methods that rely on random samples to approximate integrals. The relationship with MCMC is that you are able to use the samples obtained from MCMC to perform Monte-Carlo integration.

Monte Carlo integration is a method of integration that relies on random samples and the law of large numbers. Suppose your integration problem can be written as an expectation

$$ \mathbb{E}[f(X)] = \int f(x) p(x)\,\mathrm{d}x, $$

where $X\sim \mathbb{P}$, with density function $p(x)$. If not, you can usually rewrite your integral as an expectation, using a "reweighting trick":

$$ \int g(x)\,\mathrm{d}x = \int \frac{g(x)}{p(x)} p(x)\, \mathrm{d}x = \mathbb{E}\left[\frac{g(X)}{p(X)}\right], $$

where $X \sim \mathbb{P}$ with density function $p(x)$. Then, for an IID sample $X_1,\ldots,X_n$, the arithmetic mean of the $f(X_i)$ forms an approximation to the true expectation:

$$ \mathbb{E}[f(X)] \approx \frac{1}{n}\sum_{i=1}^n f(X_i). $$

By the law of large numbers, this approximation gets better and better as we observe more data and converges to the true value.

Example: Monte-Carlo Integration

Let's integrate the function used in the Trapezium rule example using Monte-Carlo. Recall the quantity of interest was the following

$$ I = \int_{0.1}^{0.4} \exp(\sin(1/x)) \mathrm{d}x. $$

In order to use Monte-Carlo integration we have to include a density function $p(x)$ so that $I$ takes the form of an expectation. The easiest way of doing this is to integrate against the density of the uniform distribution, $\text{Unif}(0.1,0.4)$. The density function is $p(x) = 1 / (0.4-0.1) = 10/3$ when $x \in [0.1,0.4]$ and $0$ elsewhere. In order to introduce this density function $p$ into the integrand of $I$, we also need to divide by it to maintain equality:

$$ I = \int_{0.1}^{0.4} \frac{\exp(\sin(1/x))}{p(x)} p(x) \mathrm{d}x = \frac{3}{10} \int_{0.1}^{0.4} \exp(\sin(1/x))\frac{10}{3} \mathrm{d}x = \frac{3}{10}\mathbb{E}\left[\exp(\sin(1/X))\right], $$

where $X\sim \text{Unif}(0.1,0.4)$.

Therefore, we can now approximate $I$ using Monte-Carlo integration, as

$$ I \approx \frac{3}{10} \left(\frac{1}{n}\sum_{i=1}^n \exp(\sin(1/X_i))\right), $$

where each $X_i \sim \text{Unif}(0.1,0.4)$.

Let's implement this and visualise what is going on:

In the above we have implemented Monte-Carlo integration to integrate a univariate function $f$ from $x=a$ to $x=b$ using $n$ uniform samples from the distribution $\text{Unif}(a,b)$.

Let's now visualise this, using our example function:

Using Monte-Carlo Integration to Compute the Posterior Mean

Recall we want to compute the following quantity

$$ \mathbb{E}[\theta | \underline{y}] = \int_{\Theta} \theta p(\theta|\underline{y})\, \mathrm{d}\theta . $$

In order to compute this in one use of Monte-Carlo integration, we need samples from the posterior distribution:

$$ \mathbb{E}[\theta | \underline{y}] \approx \frac{1}{n}\sum_{i=1}^n \theta_i, $$

where the $\theta_i$ are samples from $p(\theta|\underline{y})$.

In the following sections, we will explore two different methods of sampling from the posterior when we only have access to the unnormalised density: Rejection Sampling and Metropolis Hastings (an MCMC method).

Using Monte-Carlo Integration Twice to Compute the Posterior Mean

We could, of course, use our previous approach of Monte-Carlo integration twice to compute the normalising constant and the unnormalised posterior mean:

$$ \text{Integral 1: } \int_\Theta p(\underline{y}|\theta)p(\theta)\,\mathrm{d}\theta = \int_{0}^\infty p(\underline{y}|\lambda)p(\lambda)\,\mathrm{d}\lambda . $$$$ \text{Integral 2: } \int_{\Theta} \theta p(\underline{y}|\theta)p(\theta) \, \mathrm{d}\theta = \int_{0}^{\infty} \lambda p(\underline{y}|\lambda)p(\lambda) \, \mathrm{d}\lambda . $$

In this case, both integrals include the prior density $p(\theta)$ and thus are expectations (the $\theta$ term in the subscript of $\mathbb{E}_\theta$ means that $\theta$ is the random variable we are taking expectations with respect to):

$$ \text{Integral 1: } \int_\Theta p(\underline{y}|\theta)p(\theta)\,\mathrm{d}\theta = \mathbb{E}_\theta[p(\underline{y}|\theta)], $$$$ \text{Integral 2: } \int_\Theta \theta p(\underline{y}|\theta)p(\theta)\,\mathrm{d}\theta = \mathbb{E}_\theta[\theta p(\underline{y}|\theta)], $$

where $\theta\sim \text{LogNormal}(\mu, \sigma^2)$ is distributed according to the prior. We could, therefore, generate samples according to the prior and estimate these two quantities using Monte-Carlo integration. Let's do that:

Let's now estimate these quantities:

Integral 1:

Although we implemented the integrand of $\text{Integral 1}$ in the trapezium rule section as integrand_1, it included the prior density as a term which we no longer want. Realising that we now want to only evaluate the likelihood function, we will use this instead:

Integrand 2:

Let's do the same thing with the second integrand:

The main way of measuring the "error" of our Monte-Carlo estimator is the variance:

$$ \text{Var}\left[\frac{1}{n}\sum_{i=1}^n f(X_i)\right] = \frac{1}{n}\text{Var}[f(X)].$$

This is a measure of the variability of the Monte-Carlo estimator: how much do our integral estimates vary when we call our monte_carlo_integration functions. The variance is driven entirely by the variance of the random variable $f(X)$. In most situations, we are unable to compute this variance, but we are able to estimate it using the sample variance:

$$ \frac{1}{n}\text{Var}[f(X)] \approx \frac{1}{n}\left(\frac{1}{n-1} \sum_{i=1}^n \left(f(x_i) - \overline{f(\underline{x})}\right)^2\right), $$

where $\overline{f(\underline{x})} = \frac{1}{n}\sum_{i=1}^n f(x_i)$ is the sample mean of the $f(x_i)$ and each $x_i$ is a sample from the distribution of $X$.

This sample standard deviation (square root of the variance) of the Monte-Carlo estimator is what is is reported in the previous Monte-Carlo integration plots.

Rejection Sampling

The first method we will explore to directly sample from the posterior distribution is rejection sampling. This methodology is really only practical for low-dimensional distributions, hence why MCMC is more generally used. A key advantage, however, is that the samples obtained through rejection sampling will be uncorrelated.

Suppose we have a positive function $g(x)$ (positive means $g(x) > 0$ for every $x$) that has a finite integral (an unnormalised probability density function, think the unnormalised posterior):

$$ c = \int_{-\infty}^\infty g(x) \,\mathrm{d}x.$$

Then the function $p(x) = g(x)/c$ is a valid probability density function, since $p(x) > 0$ and

$$ \int_{-\infty}^\infty p(x) \,\mathrm{d}x = \frac{1}{c}\int_{-\infty}^\infty g(x) \,\mathrm{d}x = \frac{c}{c} = 1. $$

Rejection sampling allows us to, given an unnormalised probability density function $g(x)$, generate samples distributed according to the density function $p(x) = g(x)/c$.

To properly understand rejection sampling, we first need to understand how to sample uniformly in a region enclosed by a function:

Sampling Uniformly in a Region

Suppose we have the following probability density function $p(x)$:

And suppose that we want generate uniform samples in the enclosed blue region (the are beneath the density function $p(x)$). By this, we mean samples that look like:

Each sample above was generated by:

  1. First sampling an $x_i$ from the blue probability density function.
  2. Then sampling a $u_i$ from the uniform distribution $\text{Unif}(0, p(x_i))$.
  3. The pair $(x_i, u_i)$ is then uniformly distributed in the blue region enclosed by the density function.

We can also generate uniform samples in the region enclose by an unnormalised density function $g(x) = c p(x)$ (any positive function which has a finite integral):

These samples were obtained analogously, the only difference being suitably scaling up the uniform samples:

  1. First sampling an $x_i$ from the blue probability density function.
  2. Then sampling a $u_i$ from the uniform distribution $\text{Unif}(0, g(x_i))$.
  3. The pair $(x_i, u_i)$ is then uniformly distributed in the green region enclosed by $g(x)$.

Rejection Sampling

Suppose that we are now given samples $(x_i, u_i)$ that are uniformly distributed in the region enclosed by an unnormalised probability density function $g(x) = cp(x)$. Then, by our previous sampling arguments, the samples $x_i$ are distributed according to density function $p(x)$.

This is the basis of how rejection sampling work:

  1. Generate samples $(x_i, u_i)$ that are distributed uniformly in the region enclosed by the unnormalised density function $g(x) = cp(x)$.
  2. Then the samples $x_i$ are distributed according to $p(x)$

The question is then, how do we generate samples that uniformly distributed in a region without being able to generate samples from $p(x)$? This is where the "rejection" part of "rejection sampling" comes in.

The basic idea is to first generate uniform samples in an a larger region that encloses the unnormalised density function $g(x)$ and then throw away (or "reject") the samples that do not lie below $g(x)$:

More formally now, rejection sampling works as follows:

  1. Identify a density function $q(x)$ which you can sample from, and a constant $k$ such that $k q(x)$ enclose $g(x) = c p(x)$. "Enclose" really means $kq(x) \geq cp(x)$ for every $x$.
  2. Sample uniformly in the region enclosed by $kq(x)$ by first sampling an $x_i$ distributed according to $q$ and then sample a $u_i$ distributed according to $\text{Unif}(0, kq(x_i))$. The pairs of samples $(x_i, u_i)$ are then uniformly distributed in the region enclosed by $kq(x)$.
  3. Discard the samples $(x_i , u_i)$ that are not also enclosed by $g(x)$. That is, if $u_i > g(x_i)$, then discard.
  4. The remaining samples $(x_i, u_i)$ are now distributed in the region enclosed by $g(x)$ and the samples $x_i$ are distributed according to $p$.

The density function $q(x)$ is called the enveloping distribution, since it "envelops" the $g(x)$. There are two main issues with rejection sampling:

  1. How to actually find a probability density function $q(x)$ and constant $k$ such that $kq(x)$ is larger than $cp(x)$ for every $x$.
  2. How to ensure that this pair $k$ and $q(x)$ is "efficient". By this, I mean we don't have to discard too many samples.

The second issue is the reason why rejection sampling is not used in higher dimensions.

Using Rejection Sampling to Sample from the Posterior

Let's now use rejection sampling to sample from the posterior.

We will pick $q(x)$ to be the prior, this is because the posterior usually concentrates the posterior:

We obtain the "accepted" samples as follows:

There are many extensions to rejection sampling, such as adaptive rejection sampling which aims to automatically find an appropriate choice of $q$ and $k$.

Using Rejection Sampling to Estimate the Posterior Mean

Since we now have posterior samples, we can estimate the posterior mean using Monte-Carlo integration:

$$ \mathbb{E}[\theta | \underline{y}] \approx \frac{1}{n}\sum_{i=1}^n \theta_i, $$

where the $\theta_i$ are samples from $p(\theta|\underline{y})$.

This is really simple to compute: